Gender Wage Gap

Satya Gatiganti

Fall 2016

A ubiquitous issue that still exists in modern society is the presence of a gender wage gap. It would be interesting to further analyze the issue by observing how different wage gaps are between countries the world and determine what possible political and economic reasons could attribute to a country's wage gap. Then a closer study will be conducted on the gender wage gap in the United States and how the wage gap changes in different industries of profession, as well as by the role of other demographic factors like race. There seem to be an innumerable number of reasons for the gender wage gap. This project aims to focus on a few interesting factors that could have correlation with the gender wage gap.

Part One - Comparing the gender wage gap between countries

This first part of the analysis requires the use of the data source: https://data.oecd.org/earnwage/gender-wage-gap.htm and www.ilo.org/gwr-figures to construct visual depictions of the gender wage gap around the world.

Part Two - What factors affect the gender wage gap in a global context?

Women's Political and Economic Rights:

Human Development Index:

GDP Per Capita:

Part Three - Segmenting the Gender Wage Gap in the US

US Wage Gap Segmented by Profession:

US Wage Gap Segmented by Race:

Importing Packages

I first imported all packages necessary to extract my data and to plot my graphs.


In [4]:
%matplotlib inline
import sys
import pandas as pd
import pandas_datareader as 
import matplotlib.pyplot as plt
import numpy as np
import datetime as dt
import pycountry
import seaborn.apionly as sns

from plotly.offline import iplot, iplot_mpl  # plotting functions
import plotly.graph_objs as go               # ditto
import plotly                                # just to print version and init notebook
import cufflinks as cf                       # gives us df.iplot that feels like df.plot
cf.set_config_file(offline=True, offline_show_link=False)

import plotly.plotly as py

# these lines make our graphics show up in the notebook
%matplotlib inline             
plotly.offline.init_notebook_mode(connected=True)
print('Python version: ', sys.version)
print('Pandas version: ', pd.__version__)
print('Today: ', dt.date.today())


Python version:  3.5.2 |Anaconda 4.2.0 (x86_64)| (default, Jul  2 2016, 17:52:12) 
[GCC 4.2.1 Compatible Apple LLVM 4.2 (clang-425.0.28)]
Pandas version:  0.19.0
Today:  2016-12-22

Part One

Cleaning Up the Data

The first data source (https://data.oecd.org/earnwage/gender-wage-gap.htm) describes the gender wage gap in percentage terms, of 26 different countries and the other data source (www.ilo.org/gwr-figures) contains similar countries, but also includes developing countries that the first data source lacks. I will be using both of these data sources for my analysis.

Data Source - OECD


In [5]:
url = "https://stats.oecd.org/sdmx-json/data/DP_LIVE/.WAGEGAP.../OECD?contentType=csv&detail=code&separator=comma&csv-lang=en"
wg = pd.read_csv(url)  
wg
wg = wg[wg.SUBJECT!= 'SELFEMPLOYED'] #Only interested in total employment, and not just self employment
wg = wg[wg.TIME == 2010] #Only looking at the wage gaps for the most recent year: 2010
wg = wg.drop('INDICATOR', axis =1).drop('MEASURE', axis =1).drop('FREQUENCY', axis = 1).drop('Flag Codes', axis =1).drop('SUBJECT', axis=1).drop('TIME', axis =1)#these 
#Dropped all columns that I do not need
wg.columns = ["ISO", "Gender Wage Gap in % Difference"]
wg.head()


Out[5]:
ISO Gender Wage Gap in % Difference
34 AUS 14.042933
49 AUT 19.188862
65 BEL 7.043796
83 CAN 18.977470
101 CZE 15.798503

Data Source - ILO

The ILO spreadsheet only contained country names and not the ISO codes. To convert the country names to ISO codes, I used the following code to substitute country names for their respective ISO codes


In [6]:
file2 = 'data/GWG.xlsx'
gw = pd.read_excel(file2, encoding='latin-1')
gw = gw.drop('Explained wage gap', axis =1)
gw.columns = ['Country', 'Gender Wage Gap in % Difference']
gw 

import csv

dic={}


# with open("wikipedia-iso-country-codes.csv") as f:
#     file= csv.DictReader(f, delimiter=',')
#     for line in file:
#         dic[line['English short name lower case']]=line['Alpha-3 code']
        

# countries=gw['Country']

# [dic[x] for x in countries]

#Copied and pasted the values that were obtained from the line of code above 
  
gw['ISO'] = ['USA',
 'IRL',
 'GBR',
 'EST',
 'ISL',
 'CZE',
 'CYP',
 'NOR',
 'AUT',
 'NLD',
 'DEU',
 'GRC',
 'SVK',
 'BEL',
 'FIN',
 'BGR',
 'FRA',
 'ITA',
 'ESP',
 'LUX',
 'DNK',
 'LVA',
 'ROU',
 'PRT',
 'HUN',
 'POL',
 'SVN',
 'LTU',
 'SWE',
 'RUS',
 'ARG',
 'URY',
 'BRA',
 'CHL',
 'CHN',
 'PER',
 'MEX',
 'VNM', 
 'IND']

gw = gw[['ISO', 'Country', 'Gender Wage Gap in % Difference']]

gw = gw.drop('Country', axis =1)
gw.head()


Out[6]:
ISO Gender Wage Gap in % Difference
0 USA 35.79502
1 IRL 29.10423
2 GBR 29.05460
3 EST 28.94645
4 ISL 27.83397

Merging the two data sets together

I combined the two data sets together using the code below and also dropped all duplicated countries


In [7]:
combination = pd.concat([wg, gw,], axis = 0)
combination = combination.drop_duplicates('ISO')
combination.head()


Out[7]:
ISO Gender Wage Gap in % Difference
34 AUS 14.042933
49 AUT 19.188862
65 BEL 7.043796
83 CAN 18.977470
101 CZE 15.798503

Mapping out the Gender Wage Gaps in the World


In [8]:
layout = dict(geo={"scope": "world", "resolution": 150}, 
        title = 'Gender Wage Gaps',
         width=750, height=550)

In [9]:
trace = dict(type="choropleth",
             locations=combination["ISO"],   # use ISO names
             z=combination["Gender Wage Gap in % Difference"], # defines the color, 
             colorscale=[[0,"rgb(5, 10, 172)"],[0.35,"rgb(40, 60, 190)"],[0.5,"rgb(70, 100, 245)"],\
            [0.6,"rgb(90, 120, 245)"],[0.7,"rgb(106, 137, 247)"],[1,"rgb(220, 220, 220)"]],  
             text=combination.index,     
             colorbar = dict(title = 'Gender Wage Gap in % Difference between Men and Women'),
            ) 
             


iplot(go.Figure(data=[trace], layout=layout), link_text="")


By looking at the map, we can make several general observations about the difference in gender wage gaps in different areas of the world. Europe in general, seems to have lower wage gaps as compared to the rest of the world. South America and Eastern Asia both pose generally higher wage gaps as the two areas general encompass developing countries. These developing countries could have more limits on advancement of women in the workplace, or have only recently began to encourage women advancement. Korea on this map holds the highest record at 39% while Sweden holds the lowest at 4%.

Part 2 - CIRI Data

I used the data source, CIRI, for retrieving women's political and economic rights in a country in order to run a correlation with its respective gender wage gap. I extrapolated only the variables I need, including the country, year and the scores for women's economic and political rights. Again, I needed to convert the country names into ISO codes so I used the code below to do so. A score of 1.0 indicates limited rights while a score of 3.0 indicates full rights. I expect countries with lower economic and political rights to have higher wage gaps.


In [7]:
url = "http://unstats.un.org/unsd/methods/m49/m49alpha.htm"
iso = pd.read_html(url, attrs={"border": "0", "cellpadding": "2"}, header=0)[0]
iso = iso.rename(columns={"ISO ALPHA-3 code": "ISO", 
                          "Country or area name": "Country"})
iso = iso.drop("Numerical  code", axis=1)
iso = iso.set_index("Country")
iso.head()


Out[7]:
ISO
Country
Afghanistan AFG
Åland Islands ALA
Albania ALB
Algeria DZA
American Samoa ASM

In [8]:
#Women's Political and Economic Rights
url1 = 'https://drive.google.com/uc?export=download&id=0BxDpF6GQ-6fbbEdZYmRXekhGMFE'
hr = pd.read_csv(url1)
hr = hr[['CTRY', 'YEAR', 'WECON', 'WOPOL']]
hr.columns = ['Country', 'Year', 'Economic Rights', 'Political Rights']
hr = hr[hr.Year == 2011] #Only need data from the most recent year, which is 2011
hr.head()


Out[8]:
Country Year Economic Rights Political Rights
30 Afghanistan 2011 0.0 2.0
61 Albania 2011 1.0 2.0
92 Algeria 2011 1.0 2.0
123 Andorra 2011 3.0 3.0
154 Angola 2011 1.0 3.0

In [9]:
iso_hr = iso[iso.index.isin(hr['Country'])]
iso_hr.head()


Out[9]:
ISO
Country
Afghanistan AFG
Albania ALB
Algeria DZA
Andorra AND
Angola AGO

In [10]:
hr = hr.set_index('Country')

In [11]:
hr = hr.merge(iso, left_index=True, right_index=True)
hr.head()


Out[11]:
Year Economic Rights Political Rights ISO
Country
Afghanistan 2011 0.0 2.0 AFG
Albania 2011 1.0 2.0 ALB
Algeria 2011 1.0 2.0 DZA
Andorra 2011 3.0 3.0 AND
Angola 2011 1.0 3.0 AGO

In [12]:
hr = hr.set_index('ISO')

In [14]:
combination = combination.set_index('ISO')

Gender Wage Gap in % Difference vs Economic Rights Swarmplot


In [15]:
combination1 = combination.merge(hr, left_index=True, right_index=True)
combination1


Out[15]:
Gender Wage Gap in % Difference Year Economic Rights Political Rights
ISO
ARG 27.200000 2011 2.0 3.0
AUS 14.042933 2011 3.0 2.0
AUT 19.188862 2011 3.0 3.0
BEL 7.043796 2011 2.0 3.0
BRA 24.351110 2011 1.0 2.0
BGR 18.323460 2011 2.0 2.0
CAN 18.977470 2011 2.0 2.0
CHL 23.231090 2011 1.0 2.0
CHN 22.893980 2011 1.0 2.0
COL 6.430555 2011 1.0 2.0
CYP 25.660240 2011 1.0 2.0
DNK 8.895099 2011 3.0 3.0
EST 26.601547 2011 2.0 2.0
FIN 18.876999 2011 3.0 3.0
FRA 14.054337 2011 2.0 2.0
DEU 16.750813 2011 3.0 3.0
GRC 12.172841 2011 2.0 2.0
HUN 6.381712 2011 1.0 2.0
ISL 14.314784 2011 3.0 3.0
IND 24.800000 2011 1.0 3.0
IRL 12.844729 2011 3.0 2.0
ISR 20.700092 2011 1.0 2.0
ITA 9.940335 2011 3.0 2.0
JPN 28.684301 2011 1.0 2.0
LVA 13.333333 2011 1.0 2.0
LTU 6.956522 2011 2.0 2.0
LUX 4.968271 2011 3.0 3.0
MEX 11.627907 2011 2.0 2.0
NLD 18.597561 2011 3.0 3.0
NZL 7.011236 2011 3.0 3.0
NOR 8.059702 2011 3.0 3.0
PER 22.629050 2011 1.0 2.0
POL 7.190207 2011 1.0 2.0
PRT 13.450867 2011 2.0 2.0
ROU 15.370390 2011 1.0 2.0
SVN 11.633663 2011 2.0 2.0
ESP 6.585059 2011 2.0 3.0
SWE 14.321260 2011 2.0 3.0
CHE 20.053595 2011 3.0 2.0
TUR 20.064724 2011 1.0 2.0
USA 18.810680 2011 3.0 2.0
URY 27.168630 2011 2.0 2.0

In [16]:
ax = sns.swarmplot(x="Economic Rights", y="Gender Wage Gap in % Difference", data=combination1)
fig_mpl = ax.get_figure()


I used a swarmplot for this data set because there are only 3 scores on the scale of women's economic rights: 1.0, 2.0 or 3.0. It can be deciphered that generally that at a score of 1.0, there is a larger cluster of countries with higher gender wage gaps between 20% and 25%. Examples include Brazil, Peru, and China where their gender wage gaps and economic rights score are 24%, 23%, and 22% respectively. Women's economic rights include equal pay for equal work, job security, and equality in hiring and promotion practices. This is consistent with the current economic environment of developing countries. The outliers include Poland, Hungary, and Colombia all of which have relatively low wage gaps. It's important to keep in mind that a score of 1.0 includes economic rights under law but the laws do not necessarily have strict and vigilant enforcement by the government. This could attribute to the low score but low wage gap. In addition, the economic score encompasses factors such as the right to work in occupations classified as dangerous and right to work at night which are less commonly associated factors.

A less of a distinction can be made between a score of 2 and a score of 3. But, overall, a trend can be noted where a women's economic rights does have an effect on the differential wage gap.


In [17]:
fig, ax = plt.subplots()
ax.scatter(combination1['Political Rights'], combination1['Gender Wage Gap in % Difference'],     # x,y variables         # size of bubbles
            alpha=0.5)   
ax.set_title('Gender Wage Gap vs Political Rights', loc='left', fontsize=14)
ax.set_xlabel('Political Rights')
ax.set_ylabel('Gender Wage Gap in % Difference')


Out[17]:
<matplotlib.text.Text at 0x11d5bd198>

Political rights include a women's right to vote, hold political office, and petition government officials. Under a score of 3.0, there is a larger cluster of countries that have lower gender wage gaps between 5% and 10%. However, there is a less of a noteworthy trend between the correlation of the political rights score and the gender wage gap. Perhaps a country's womens' political rights does not necessarily correlate with their economic rights?

Correlation between womens' political rights and economic rights


In [19]:
np.corrcoef(hr['Economic Rights'], hr['Political Rights'])[0, 1]


Out[19]:
0.98981153519249365

There appears to be a very high correlation coefficient between womens' political and economic rights. So it can be concluded that the role of a womens' political rights is not as indicatuve of the gender wage gap as their economic rights do.

Average gender wage gaps for countries with a political rights score of 2 and 3

In this section, the means of the gender wage gaps for countries with political rights scores of 2/3 are calculated to measure if a lower political rights score will result in a lower gender wage gap.


In [130]:
# Gender Wage Gap in % Difference when Political Rights is 2.0 
scoretwopolitical = [2.0]
political2 = combination1[combination1['Political Rights'].isin(scoretwopolitical)].drop('Economic Rights', axis =1).drop('Year', axis =1)
political2.head()


Out[130]:
Gender Wage Gap in % Difference Political Rights
ISO
AUS 14.042933 2.0
BRA 24.351110 2.0
BGR 18.323460 2.0
CAN 18.977470 2.0
CHL 23.231090 2.0

In [131]:
politicalrights2 = political2['Gender Wage Gap in % Difference']
politicalrights2.mean()


Out[131]:
17.301709266666666

In [128]:
# Gender Wage Gap in % Difference when Political Rights is 3.0 
scoretwopolitical = [3.0]
political3 = combination1[combination1['Political Rights'].isin(scoretwopolitical)].drop('Economic Rights', axis =1).drop('Year', axis =1)
political3.head()


Out[128]:
Gender Wage Gap in % Difference Political Rights
ISO
ARG 27.200000 3.0
AUT 19.188862 3.0
BEL 7.043796 3.0
DNK 8.895099 3.0
FIN 18.876999 3.0

In [132]:
politicalrights3 = political3['Gender Wage Gap in % Difference']
politicalrights3.mean()


Out[132]:
14.043817242857141

The hypothesis that a lower political rights score would indicate a higher gender wage gap seems to be consistent with the calculated means. However, it is important to keep in mind that the sample size for countries with a score of 3 is rather small.

Part 2 - HDI Data

For this data set, I will compare the HDI index of a country with its gender wage gap. The HDI index is defined as a composite statistic of life expectancy, education, and per capita income indicators, which are used to rank countries into four tiers of human development. I expect countries with lower HDI indexes to have higher wage gaps. For the first step, I am converting the country names to their appropriate ISO codes.


In [20]:
url = "http://unstats.un.org/unsd/methods/m49/m49alpha.htm"
iso = pd.read_html(url, attrs={"border": "0", "cellpadding": "2"}, header=0)[0]
iso = iso.rename(columns={"ISO ALPHA-3 code": "ISO", 
                          "Country or area name": "Country"})
iso = iso.drop("Numerical  code", axis=1)
iso = iso.set_index("Country")
iso.head()


Out[20]:
ISO
Country
Afghanistan AFG
Åland Islands ALA
Albania ALB
Algeria DZA
American Samoa ASM

In [21]:
file2 = 'data/2015_Statistical_Annex_Table_1.xls'
hdi = pd.read_excel(file2)
hdi = hdi[['Country','Human Development Index']]
hdi.head()


Out[21]:
Country Human Development Index
0 Norway 0.943877
1 Australia 0.934958
2 Switzerland 0.929613
3 Denmark 0.923328
4 Netherlands 0.921794

In [22]:
iso_hdi = iso[iso.index.isin(hdi['Country'])]

iso_hdi.head()


Out[22]:
ISO
Country
Afghanistan AFG
Albania ALB
Algeria DZA
Andorra AND
Angola AGO

In [23]:
hdi = hdi.set_index('Country')

In [24]:
hdi = hdi.merge(iso, left_index=True, right_index=True)
hdi.head()


Out[24]:
Human Development Index ISO
Country
Afghanistan 0.465264 AFG
Albania 0.732766 ALB
Algeria 0.735624 DZA
Andorra 0.844642 AND
Angola 0.531591 AGO

In [25]:
hdi = hdi.set_index('ISO')

In [26]:
combination2 = combination.merge(hdi, left_index=True, right_index=True)
combination2.head()


Out[26]:
Gender Wage Gap in % Difference Human Development Index
ISO
ARG 27.200000 0.835572
AUS 14.042933 0.934958
AUT 19.188862 0.885027
BEL 7.043796 0.890263
BRA 24.351110 0.755292

In [27]:
fig, ax = plt.subplots()
ax.scatter(combination2['Human Development Index'], combination2['Gender Wage Gap in % Difference'],     # x,y variables         # size of bubbles
            alpha=0.5)   
ax.set_title('Gender Wage Gap vs HDI', loc='left', fontsize=14)
ax.set_xlabel('HDI')
ax.set_ylabel('Gender Wage Gap in % Difference')


Out[27]:
<matplotlib.text.Text at 0x11dc6c0b8>

In [29]:
np.corrcoef(combination2['Gender Wage Gap in % Difference'], combination2['Human Development Index'])[0, 1]


Out[29]:
-0.29077952683532698

Analysis - HDI Data

When using the scatter plot and taking the correlation coefficient (-.29) into consideration, it can be established that there is a weak correlation between gender wage gaps and their respective HDI indexes. Perhaps, the HDI index contains too many broad factors, each of which has a different relation with the gender wage gap.

Part 2 - GDP Per Capita Data

For this data set, I will compare the GDP per capita of a country with its gender wage gap. GDP per capita is an indicator of a country's economic activity as well as the purchasing power of its residents. It denotes the living standards and how advanced any particular economy maybe. I would expect countries with lower GDP per capitas to have higher gender wage gaps. For the first step, I am converting the country names to their appropriate ISO codes.


In [44]:
url = "http://unstats.un.org/unsd/methods/m49/m49alpha.htm"
iso = pd.read_html(url, attrs={"border": "0", "cellpadding": "2"}, header=0)[0]
iso = iso.rename(columns={"ISO ALPHA-3 code": "ISO", 
                          "Country or area name": "Country"})
iso = iso.drop("Numerical  code", axis=1)
iso = iso.set_index("Country")
iso.head()


Out[44]:
ISO
Country
Afghanistan AFG
Åland Islands ALA
Albania ALB
Algeria DZA
American Samoa ASM

In [45]:
# World Bank Data 
file2 = '/Users/satyagatiganti/Desktop/Data_Bootcamp/WDI_Data.csv'
wb = pd.read_csv(file2, encoding = 'latin-1')
wb = wb.rename(columns={'Indicator Name': 'Factor'})
vlist = ['GDP per capita (constant 2010 US$)']
wb = wb[wb['Factor'].isin(vlist)]
wb = wb[['Country Name', 'Country Code', 'Factor', '2015']]
wb.columns = ['Country', 'ISO', 'Factor', 'GDP Per Capita']
wb = wb.drop('Factor', axis=1)
wb.head(5)


Out[45]:
Country ISO GDP Per Capita
497 Arab World ARB 6400.320671
1943 Caribbean small states CSS 9004.192587
3389 Central Europe and the Baltics CEB 14163.027829
4835 Early-demographic dividend EAR 3339.790563
6281 East Asia & Pacific EAS 9041.731421

In [46]:
iso_wb = iso[iso.index.isin(wb['Country'])]

iso_wb.head()


Out[46]:
ISO
Country
Afghanistan AFG
Albania ALB
Algeria DZA
American Samoa ASM
Andorra AND

In [47]:
wb = wb.set_index('Country')

In [48]:
wb = wb.merge(iso, left_index=True, right_index =True)
wb.head()


Out[48]:
ISO_x GDP Per Capita ISO_y
Country
Afghanistan AFG 623.925524 AFG
Albania ALB 4541.386209 ALB
Algeria DZA 4794.048900 DZA
American Samoa ASM NaN ASM
Andorra ADO NaN AND

In [49]:
wb = wb.set_index('ISO_x')

In [51]:
combination3 = combination.merge(wb, left_index=True, right_index=True)
combination3.head()


Out[51]:
Gender Wage Gap in % Difference GDP Per Capita ISO_y
ARG 27.200000 10514.587895 ARG
AUS 14.042933 54717.706705 AUS
AUT 19.188862 47667.805610 AUT
BEL 7.043796 44863.088182 BEL
BRA 24.351110 11159.254155 BRA

In [52]:
combination3 = combination3.sort(['GDP Per Capita'], ascending = 1)
combination3.head()


/Users/satyagatiganti/anaconda/lib/python3.5/site-packages/ipykernel/__main__.py:1: FutureWarning:

sort(columns=....) is deprecated, use sort_values(by=.....)

Out[52]:
Gender Wage Gap in % Difference GDP Per Capita ISO_y
IND 24.800000 1805.579625 IND
PER 22.629050 5974.477245 PER
CHN 22.893980 6416.183355 CHN
COL 6.430555 7447.779147 COL
BGR 18.323460 7502.436143 BGR

In [53]:
fig, ax = plt.subplots(figsize = (20,8))  
combination3['Gender Wage Gap in % Difference'].plot(kind = 'bar', ax=ax)
ax.set_title('Gender Wage Gap by GDP Per Capita', loc='center', fontsize=14)
ax.set_xlabel('Countries in Ascending GDP Per Capita', fontsize = 12)
ax.set_ylabel('Gender Wage Gap in % Difference', fontsize = 12)


Out[53]:
<matplotlib.text.Text at 0x1177661d0>

The countries on the x-axis in the graph above are listed in ascending order of GDP Per Capita. When looking at the graph, there appears to be little correlation between GDP per capita and Gender Wage Gap in % Difference. Similar to the analysis of political rights, perhaps GDP per capita is too broad of a factor to use in evaluating the gender wage gap.

Part 2 - Maternity Leave Data

For this data set, I will compare the Maternity Leave in Weeks of a country with its gender wage gap. I would expect countries with shorter maternity leaves to have higher gender wage gaps because this might indicate discrimination in the workplace. For the first step, I am converting the country names to their appropriate ISO codes


In [127]:
url = "http://unstats.un.org/unsd/methods/m49/m49alpha.htm"
iso = pd.read_html(url, attrs={"border": "0", "cellpadding": "2"}, header=0)[0]
iso = iso.rename(columns={"ISO ALPHA-3 code": "ISO", 
                          "Country or area name": "Country"})
iso = iso.drop("Numerical  code", axis=1)
iso = iso.set_index("Country")
iso.head()


Out[127]:
ISO
Country
Afghanistan AFG
Åland Islands ALA
Albania ALB
Algeria DZA
American Samoa ASM

In [128]:
file3 = 'data/Maternity Leave Data.xlsx'
ml = pd.read_excel(file3)
ml = ml[['Country', '2015']]
ml.columns = ['Country', 'Maternity Leave in Weeks']
ml = ml.drop(0).drop(36)
ml.head()


Out[128]:
Country Maternity Leave in Weeks
1 Australia 6.0
2 Austria 16.0
3 Belgium 15.0
4 Canada 17.0
5 Chile 18.0

In [129]:
iso_ml = iso[iso.index.isin(ml['Country'])]

iso_ml.head()


Out[129]:
ISO
Country
Australia AUS
Austria AUT
Belgium BEL
Canada CAN
Chile CHL

In [130]:
ml = ml.set_index('Country')

In [131]:
ml = ml.merge(iso, left_index=True, right_index =True)
ml.head()


Out[131]:
Maternity Leave in Weeks ISO
Country
Australia 6.0 AUS
Austria 16.0 AUT
Belgium 15.0 BEL
Canada 17.0 CAN
Chile 18.0 CHL

In [132]:
ml = ml.set_index('ISO')

In [133]:
combination4 = combination.merge(ml, left_index=True, right_index=True)
combination4.head()


Out[133]:
Gender Wage Gap in % Difference Maternity Leave in Weeks
ISO
AUS 14.042933 6.0
AUT 19.188862 16.0
BEL 7.043796 15.0
CAN 18.977470 17.0
CHL 23.231090 18.0

In [134]:
np.corrcoef(combination4['Maternity Leave in Weeks'], combination4['Gender Wage Gap in % Difference'])[0, 1]


Out[134]:
-0.076387039957295663

No figure was charted for this anaylsis because the correlation coefficient between maternity leave in weeks and the gender wage gap in % difference appears to be minimal, indicating a weak correlation. This policy appears to have little correlation with salaries, indicating there is not necessarily discrimination with regards to this.

Part 3 - Segmenting the Gender Wage Gap in the US

The current gender wage gap in the US stands at nearly 20%. I will be segmenting this percentage by profession and by race.

For the BLS data set, I will extrapolate a variety of professions, each of which is already associated with its own gender wage gap. I want to compare the difference in wage gaps according to industry and the reasons for these differences.


In [10]:
#US earnings by occupation and sex
file4 = 'http://www.bls.gov/cps/cpsaat39.xlsx'
us = pd.read_excel(file4)
us = us.drop(0).drop(1).drop(2).drop(3).drop(5)
us = us.drop('Unnamed: 1', axis =1).drop('Unnamed: 2', axis=1).drop('Unnamed: 3', axis =1).drop('Unnamed: 5', axis=1)
us.columns = ['Occupation', 'Men median weekly earnings', 'Women median weekly earnings']
us.head()


Out[10]:
Occupation Men median weekly earnings Women median weekly earnings
4 NaN Median weekly earnings Median weekly earnings
6 Total, full-time wage and salary workers 895 726
7 NaN NaN NaN
8 Management, professional, and related occupations 1383 996
9 Management, business, and financial operations... 1436 1073

I extracted the men and women median weekly earnings for 12 different industries. These professions are spread across many different industries.


In [11]:
occupationlist = ['Management occupations', 'Business and financial operations occupations', 'Architecture and engineering occupations', 'Community and social service occupations','Legal occupations', 'Arts, design, entertainment, sports, and media occupations', 'Healthcare practitioners and technical occupations', 'Food preparation and serving related occupations', 'Sales and related occupations', 'Office and administrative support occupations', 'Construction and extraction occupations', 'Education, training, and library occupations']
us = us[us['Occupation'].isin(occupationlist)]
us


Out[11]:
Occupation Men median weekly earnings Women median weekly earnings
10 Management occupations 1486 1139
41 Business and financial operations occupations 1327 1004
88 Architecture and engineering occupations 1452 1257
134 Community and social service occupations 973 845
143 Legal occupations 1877 1135
149 Education, training, and library occupations 1144 907
161 Arts, design, entertainment, sports, and media... 1088 942
181 Healthcare practitioners and technical occupat... 1272 991
248 Food preparation and serving related occupations 481 414
292 Sales and related occupations 880 578
311 Office and administrative support occupations 693 646
376 Construction and extraction occupations 751 704

In [12]:
# cleaning up the data and making the names of the occupations shorter
us = us.set_value(10, 'Occupation', 'Management').set_value(41, 'Occupation', 'Business & financial ops').set_value(88, 'Occupation', 'Architecture and Engineering').set_value(134, 'Occupation', 'Community & social service').set_value(143, 'Occupation', 'Legal').set_value(149, 'Occupation', 'Education').set_value(161, 'Occupation', 'Arts, entertainment, sports').set_value(181, 'Occupation', 'Healthcare').set_value(248, 'Occupation', 'Food Prep').set_value(292, 'Occupation', 'Sales').set_value(311, 'Occupation', 'Office & admin support').set_value(376, 'Occupation', 'Construction')
us


Out[12]:
Occupation Men median weekly earnings Women median weekly earnings
10 Management 1486 1139
41 Business & financial ops 1327 1004
88 Architecture and Engineering 1452 1257
134 Community & social service 973 845
143 Legal 1877 1135
149 Education 1144 907
161 Arts, entertainment, sports 1088 942
181 Healthcare 1272 991
248 Food Prep 481 414
292 Sales 880 578
311 Office & admin support 693 646
376 Construction 751 704

In [13]:
us = us.set_index('Occupation')

In [14]:
men = dict(type="scatter", 
            name="Men", 
            mode="markers",                       # draw dots
            x=us["Men median weekly earnings"],                    # x data
            y=us.index,                    # y data
            marker={"color": "Blue", "size": 12}  # dot color/size
           )
women = dict(type="scatter", 
             name="Women", 
             mode="markers",
             x=us['Women median weekly earnings'], 
             y=us.index,
             marker={"color": "Pink", "size": 12}
            )

def draw_line(row):
    sc = row.name
    line = dict(type="scatter",                # trace type
                x=[row["Women median weekly earnings"], row["Men median weekly earnings"]],  # x data
                y=[sc, sc],                    # y data flat
                mode="lines",                  # draw line
                name=sc,                       # name trace
                showlegend=False,              # no legend entry
                line={"color": "gray"}         # line color
               )
    return line
lines = list(us.apply(draw_line, axis=1))

layout = go.Layout 

layout = dict(width=600, height=750,                        # plot width/height
              yaxis={"title": "Occupation"},                    # yaxis label
              title="Gender earnings disparity by profession",            # title
              xaxis={"title": "Median Weekly Earnings"})
            
    # xaxis label}
             
# use + for two lists
data = [men, women] + lines  

# build and display the figure
fig = go.Figure(data=data, layout=layout)
iplot(fig)


According to the graph above, the largest wage gap is in the legal industry. It could be due to the practice areas women take up in the law. For example, women are more likely to practice family or employment law, both of which offer lower salaries than other types of laws that men are more likely to take up. However, after doing some research, it is uncovered that women lawyers are paid less regardless of how much they work. Some attribute the problem to the negotiating powers of women versus men. Areas that include a small wage gap are construction and admin support. The nature of these industries is such that they offer less flexibility in negotiation of salaries.

Part 3 - US Wage Gap by Race

For the BLS data set, I will extrapolate a variety of professions, each of which is already associated with its own gender wage gap. I want to compare the difference in wage gaps according to industry and the reasons for these differences.


In [18]:
file5 = 'http://www.bls.gov/cps/cpsaat37.xlsx'
demographics = pd.read_excel(file5)
demographics
demographics = demographics.drop(0).drop(1).drop(2).drop(3).drop(4).drop(5).drop(6).drop(7).drop(8).drop(9).drop(10).drop(11).drop(12).drop(13).drop(14).drop(15).drop(19).drop(23).drop(27).drop(31).drop(32).drop(16).drop(20).drop(24).drop(28)      
demographics = demographics.drop('Unnamed: 1', axis=1).drop('Unnamed: 3', axis =1).drop('Unnamed: 2',axis =1)#.drop('HOUSEHOLD DATA ANNUAL AVERAGES 37. Median weekly earnings of full-time wage and salary workers by selected characteristics', axis =1)
demographics.columns = ['Demographics', 'Median weekly earnings']
demographics = demographics.drop('Demographics', axis=1)
vlist4 = [920.0, 680.0, 1129.0, 5631.0]
demographics = demographics[demographics['Median weekly earnings'].isin(vlist)]
demographics['Race'] = ['White', 'Black or African American', 'Asian', 'Hispanic or Latino']
demographics['Women Median Weekly Earnings'] = [743.0, 615.0, 877.0, 556.0]
demographics['Men Median Weekly Earnings']= [920.0, 680.0, 1129.0, 631.0]
demographics = demographics.drop('Median weekly earnings', axis=1)
demographics = demographics.set_index('Race')
demographics


Out[18]:
Women Median Weekly Earnings Men Median Weekly Earnings
Race
White 743.0 920.0
Black or African American 615.0 680.0
Asian 877.0 1129.0
Hispanic or Latino 556.0 631.0

In [19]:
Women = dict(type="bar",                                      # trace type
           orientation="h",                                 # make bars horizontal
           name="Women",                                      # legend entry
           x=demographics["Women Median Weekly Earnings"],                               # x data
           y=demographics.index,                                # y data
           marker={"color": "Pink"}                         # blue bars
          )
Men = dict(type="bar",                                    # trace type
             orientation="h",                               # horizontal bars
             name="Men",                                  # legend entry
             x=demographics["Men Median Weekly Earnings"],                           # x data
             y=demographics.index,                              # y data
             marker={"color": "Blue"}                       # pink bars
            )
layout = dict(width=650, height=750,                        # plot width/height
              yaxis={"title": "Race"},                    # yaxis label
              title="Earnings by Gender and Race",            # title
              xaxis={"title": "Median Weekly Earnings"}  # xaxis label}
             )

iplot(go.Figure(data=[Men, Women], layout=layout))


According to the bar graph above, the largest gender wage among races is obseved between Asian women and Asian men. Studies have shows that this wage gap can be attributable to discrimination in the work place, especially among Asian women. Other contributing factors, according to Pew Research, are risk aversion and negotiation skills; both of which Asians do not mobilize as heavily as by other races. Interestingly, negotiation plays a significant role in explaining the wage gap, whether segmented by occupation or by race. Another theory is that perhaps women do not necessarily seek the same high paying STEM fields as men do because of the experiences rooted in the prejudices, such as higher encouragement for men to pursue these professions. An important point to distinguish is that both Asian men and women make more money than their black, hispanic, or white counterparts. However, as Asian women are climbing the career ladder and thereby earning higher salaries, there is far more discrimination present at the executive levels. This could account for the larger wage gap among Asians.

Conclusion

Of the factors I used to compate the gender wages gaps in the world with, only the economic rights of a country seems to have correlation with the gender wage gaps. This indicates that the gender wage gap is a highly complex situation that has no direct and consistent reasons for its existence. The gender wage gap is attributable to an individual country's culture, government, and economic opportunities.

Segmenting the wage gap in the United States allows us to observe where women are making headways, and where there is still progress to be made. Professions such as food prep, construction, and admin support shower smaller wage gaps where as legal and sales demonstrate higher wage gaps. This alludes to the skill level, work culture, and long lasting prejudices imbedded within these professions. In terms of race, while women on average, are earning more than they did in the past, there still exists high barriers at more senior and executive levels.

Overall, this analysis demonstrates that women are certainly making headway in closing the gender wage gap but there is still room for improvement.


In [ ]: